Project Synopsis:¶
This notebook explores the TFL Cycle Hire dataset , creating a regression model to understand the factors influencing bike rentals in London.
Workflow¶
Data Acquisition & Cleaning
- Load the raw TFL cycle hire dataset.
- Handle missing values and inconsistent data types.
- Create new engineered features (e.g., weekend, COVID flag, daylight time).
Exploratory Data Analysis (EDA)
- Summarize descriptive statistics of bike hires and weather conditions.
- Visualize patterns across time (daily, weekly, monthly, seasonal).
- Explore relationships between bike hires and external factors like temperature, precipitation, and COVID restrictions.
Modeling Approach
- Apply OLS regression to measure correlations and infer causation between weather, time, and external factors on bike hires.
- Experiment with interaction terms and categorical variables (day of week, season, COVID).
- Evaluate model fit using R², Adjusted R², RMSE, and residual diagnostics.
In [200]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates # Import for custom date formatting
import seaborn as sns
from datetime import datetime
from patsy import dmatrices
from skimpy import skim # Used for generating summary statistics of DataFrames
import warnings
import xgboost as xgb
warnings.filterwarnings('ignore')
# Statistical analysis libraries
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.inspection import PartialDependenceDisplay
In [201]:
# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
In [202]:
# =============================================================================
# London Bikes data
# =============================================================================
# We'll use data from http://www.tfl.gov.uk to analyse usage of the London Bike Sharing scheme.
# This data has already been downloaded for you and exists in a CSV (Comma Separated Values)
# file that you have to read in to Python.
# Read the CSV file
bike = pd.read_csv(r"C:\Users\SKEN\my-project\london_bikes.csv")
# Display the first few rows of the dataframe
print("\nFirst few rows of the dataset:")
print(bike.head())
# Create weekend indicator
bike['weekend'] = bike['wday'].isin(['Sat', 'Sun'])
print(f"Dataset shape: {bike.shape}")
print(f"Date range: {bike['date'].min()} to {bike['date'].max()}")
First few rows of the dataset:
date bikes_hired tempmax tempmin temp feelslikemax \
0 2010-07-30T00:00:00Z 6897 21.5 10.7 16.8 21.5
1 2010-07-31T00:00:00Z 5564 23.0 15.0 18.9 23.0
2 2010-08-01T00:00:00Z 4303 21.0 14.0 17.4 21.0
3 2010-08-02T00:00:00Z 6642 20.4 14.3 17.1 20.4
4 2010-08-03T00:00:00Z 7966 21.6 11.6 17.2 21.6
feelslikemin feelslike dew humidity precip precipprob precipcover \
0 10.7 16.8 10.3 68.1 0.000 0 0.00
1 15.0 18.9 14.3 76.3 2.029 100 8.33
2 14.0 17.4 11.5 69.1 0.000 0 0.00
3 14.3 17.1 11.6 71.1 3.635 100 8.33
4 11.6 17.2 11.4 71.4 0.153 100 8.33
preciptype snow snowdepth windgust windspeed winddir sealevelpressure \
0 NaN 0.0 0.0 NaN 19.2 252.2 1014.0
1 rain 0.0 0.0 NaN 19.7 246.0 1010.9
2 NaN 0.0 0.0 NaN 13.4 260.6 1012.7
3 rain 0.0 0.0 NaN 16.2 327.6 1016.3
4 rain 0.0 0.0 NaN 17.6 252.7 1015.3
cloudcover visibility solarradiation solarenergy uvindex severerisk \
0 56.1 22.9 277.5 23.9 9 NaN
1 59.2 19.7 228.0 19.6 7 NaN
2 64.7 25.2 257.8 22.3 8 NaN
3 75.8 24.5 265.9 23.0 7 NaN
4 50.9 21.8 285.3 24.7 8 NaN
sunrise sunset moonphase \
0 2010-07-30T05:17:39Z 2010-07-30T20:50:39Z 0.65
1 2010-07-31T05:19:09Z 2010-07-31T20:49:03Z 0.68
2 2010-08-01T05:20:39Z 2010-08-01T20:47:25Z 0.71
3 2010-08-02T05:22:11Z 2010-08-02T20:45:45Z 0.75
4 2010-08-03T05:23:42Z 2010-08-03T20:44:03Z 0.75
conditions \
0 Partially cloudy
1 Rain, Partially cloudy
2 Partially cloudy
3 Rain, Partially cloudy
4 Rain, Partially cloudy
description \
0 Partly cloudy throughout the day.
1 Partly cloudy throughout the day with rain.
2 Partly cloudy throughout the day.
3 Partly cloudy throughout the day with rain.
4 Becoming cloudy in the afternoon with rain clearing later.
icon month_name month day_of_week wday week year weekend \
0 partly-cloudy-day Jul 7 Fri Fri 30 2010 False
1 rain Jul 7 Sat Sat 30 2010 True
2 partly-cloudy-day Aug 8 Sun Sun 30 2010 True
3 rain Aug 8 Mon Mon 31 2010 False
4 rain Aug 8 Tue Tue 31 2010 False
season_name set
0 Summer train
1 Summer train
2 Summer train
3 Summer train
4 Summer train
Dataset shape: (5268, 41)
Date range: 2010-07-30T00:00:00Z to 2024-12-30T00:00:00Z
Cleaning Data¶
In [203]:
# =============================================================================
# Cleaning our data
# =============================================================================
# Fix dates using pandas, and generate new variables for year, month, month_name, day, and day_of_week
# Convert the 'date' column from text to actual date format so Python knows it's dates
bike['date'] = pd.to_datetime(bike['date'])
# Create a list with the correct order of month abbreviations
# This ensures months appear in calendar order instead of alphabetical
bike = bike.assign(
year=bike['date'].dt.year,
month=bike['date'].dt.month,
month_name=bike['date'].dt.month_name()
)
In [204]:
# Dropping the empty value outliers due to system maintenance in later 2022
bike = bike.drop(bike[bike["date"].isin(["2022-09-09", "2022-09-10", "2022-09-11"])].index)
# Dropping the severe risk column due to it being all nulls
bike = bike.drop(columns="severerisk",axis=1)
# Filling the precip type NaN values with Nil to capture the precipitation effect if needed
bike["preciptype"] = bike["preciptype"].fillna("Nil")
# Checking for other nulls across the columns
bike[bike.isna().any(axis=1)]
Out[204]:
| date | bikes_hired | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | preciptype | snow | snowdepth | windgust | windspeed | winddir | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | sunrise | sunset | moonphase | conditions | description | icon | month_name | month | day_of_week | wday | week | year | weekend | season_name | set | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010-07-30 00:00:00+00:00 | 6897 | 21.5 | 10.7 | 16.8 | 21.5 | 10.7 | 16.8 | 10.3 | 68.1 | 0.000 | 0 | 0.00 | Nil | 0.0 | 0.0 | NaN | 19.2 | 252.2 | 1014.0 | 56.1 | 22.9 | 277.5 | 23.9 | 9 | 2010-07-30T05:17:39Z | 2010-07-30T20:50:39Z | 0.65 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | July | 7 | Fri | Fri | 30 | 2010 | False | Summer | train |
| 1 | 2010-07-31 00:00:00+00:00 | 5564 | 23.0 | 15.0 | 18.9 | 23.0 | 15.0 | 18.9 | 14.3 | 76.3 | 2.029 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 19.7 | 246.0 | 1010.9 | 59.2 | 19.7 | 228.0 | 19.6 | 7 | 2010-07-31T05:19:09Z | 2010-07-31T20:49:03Z | 0.68 | Rain, Partially cloudy | Partly cloudy throughout the day with rain. | rain | July | 7 | Sat | Sat | 30 | 2010 | True | Summer | train |
| 2 | 2010-08-01 00:00:00+00:00 | 4303 | 21.0 | 14.0 | 17.4 | 21.0 | 14.0 | 17.4 | 11.5 | 69.1 | 0.000 | 0 | 0.00 | Nil | 0.0 | 0.0 | NaN | 13.4 | 260.6 | 1012.7 | 64.7 | 25.2 | 257.8 | 22.3 | 8 | 2010-08-01T05:20:39Z | 2010-08-01T20:47:25Z | 0.71 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | August | 8 | Sun | Sun | 30 | 2010 | True | Summer | train |
| 3 | 2010-08-02 00:00:00+00:00 | 6642 | 20.4 | 14.3 | 17.1 | 20.4 | 14.3 | 17.1 | 11.6 | 71.1 | 3.635 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 16.2 | 327.6 | 1016.3 | 75.8 | 24.5 | 265.9 | 23.0 | 7 | 2010-08-02T05:22:11Z | 2010-08-02T20:45:45Z | 0.75 | Rain, Partially cloudy | Partly cloudy throughout the day with rain. | rain | August | 8 | Mon | Mon | 31 | 2010 | False | Summer | train |
| 4 | 2010-08-03 00:00:00+00:00 | 7966 | 21.6 | 11.6 | 17.2 | 21.6 | 11.6 | 17.2 | 11.4 | 71.4 | 0.153 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 17.6 | 252.7 | 1015.3 | 50.9 | 21.8 | 285.3 | 24.7 | 8 | 2010-08-03T05:23:42Z | 2010-08-03T20:44:03Z | 0.75 | Rain, Partially cloudy | Becoming cloudy in the afternoon with rain clearing later. | rain | August | 8 | Tue | Tue | 31 | 2010 | False | Summer | train |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1580 | 2014-11-26 00:00:00+00:00 | 23096 | 8.9 | 7.7 | 8.3 | 8.9 | 5.2 | 7.3 | 8.1 | 98.4 | 9.630 | 100 | 16.67 | rain | 0.0 | 0.0 | NaN | 15.0 | 91.7 | 1011.3 | 96.3 | 3.2 | 18.2 | 1.5 | 1 | 2014-11-26T07:35:03Z | 2014-11-26T15:55:30Z | 0.14 | Rain, Overcast | Cloudy skies throughout the day with a chance of rain throughout the day. | rain | November | 11 | Wed | Wed | 48 | 2014 | False | Autumn | train |
| 1581 | 2014-11-27 00:00:00+00:00 | 26765 | 10.6 | 8.8 | 9.8 | 10.6 | 7.4 | 8.7 | 9.5 | 98.2 | 1.637 | 100 | 12.50 | rain | 0.0 | 0.0 | NaN | 15.1 | 128.4 | 1003.5 | 84.1 | 6.0 | 36.2 | 3.3 | 2 | 2014-11-27T07:36:35Z | 2014-11-27T15:54:38Z | 0.18 | Rain, Partially cloudy | Partly cloudy throughout the day with a chance of rain throughout the day. | rain | November | 11 | Thu | Thu | 48 | 2014 | False | Autumn | train |
| 1582 | 2014-11-28 00:00:00+00:00 | 26782 | 11.3 | 9.8 | 10.5 | 11.3 | 7.8 | 10.3 | 9.7 | 94.9 | 0.129 | 100 | 4.17 | rain | 0.0 | 0.0 | NaN | 24.2 | 85.7 | 1004.1 | 83.3 | 6.8 | 51.0 | 4.3 | 2 | 2014-11-28T07:38:06Z | 2014-11-28T15:53:48Z | 0.21 | Rain, Partially cloudy | Partly cloudy throughout the day with morning rain. | rain | November | 11 | Fri | Fri | 48 | 2014 | False | Autumn | train |
| 1583 | 2014-11-29 00:00:00+00:00 | 24337 | 11.5 | 6.4 | 9.5 | 11.5 | 4.7 | 8.5 | 8.1 | 91.7 | 0.038 | 100 | 4.17 | rain | 0.0 | 0.0 | NaN | 19.8 | 106.6 | 1010.7 | 74.3 | 7.1 | 45.4 | 3.9 | 2 | 2014-11-29T07:39:35Z | 2014-11-29T15:53:02Z | 0.25 | Rain, Partially cloudy | Partly cloudy throughout the day with early morning rain. | rain | November | 11 | Sat | Sat | 48 | 2014 | True | Autumn | train |
| 1584 | 2014-11-30 00:00:00+00:00 | 19998 | 10.6 | 4.4 | 8.1 | 10.6 | 2.2 | 6.4 | 7.4 | 95.6 | 0.196 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 24.0 | 331.2 | 1013.2 | 81.6 | 6.5 | 30.5 | 2.5 | 2 | 2014-11-30T07:41:03Z | 2014-11-30T15:52:19Z | 0.28 | Rain, Partially cloudy | Partly cloudy throughout the day with rain. | rain | November | 11 | Sun | Sun | 48 | 2014 | True | Autumn | train |
1126 rows × 40 columns
Times, Days and Seasons¶
In [205]:
# Define a list of month names
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']
# Define a list of day names
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# Convert the 'day_of_week' column to a categorical type with the specified order
# Categorical = tells pandas this column has a specific order, not just random text
bike['day_of_week'] = pd.Categorical(bike['day_of_week'],
categories=day_order,
ordered=True)
# Convert the 'month_name' column to a categorical type with the specified order
bike['month_name'] = pd.Categorical(bike['month_name'],
categories=month_order,
ordered=True)
# Generate new variable season_name to turn seasons from numbers to Winter, Spring, etc
# This function takes a month name and returns which season it belongs to
def get_season(month_name):
if month_name in ['December', 'January', 'February']: # Winter months
return 'Winter'
elif month_name in ['March', 'April', 'May']: # Spring months
return 'Spring'
elif month_name in ['June', 'July', 'August']: # Summer months
return 'Summer'
else: # Autumn months (Sep, Oct, Nov)
return 'Autumn'
# Apply the season function to each row to create a new 'season_name' column
bike['season_name'] = bike['month_name'].apply(get_season)
# Make season_name an ordered categorical so seasons appear in logical order
bike['season_name'] = pd.Categorical(bike['season_name'],
categories=['Winter', 'Spring', 'Summer', 'Autumn'],
ordered=True)
bike.head()
Out[205]:
| date | bikes_hired | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | preciptype | snow | snowdepth | windgust | windspeed | winddir | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | sunrise | sunset | moonphase | conditions | description | icon | month_name | month | day_of_week | wday | week | year | weekend | season_name | set | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010-07-30 00:00:00+00:00 | 6897 | 21.5 | 10.7 | 16.8 | 21.5 | 10.7 | 16.8 | 10.3 | 68.1 | 0.000 | 0 | 0.00 | Nil | 0.0 | 0.0 | NaN | 19.2 | 252.2 | 1014.0 | 56.1 | 22.9 | 277.5 | 23.9 | 9 | 2010-07-30T05:17:39Z | 2010-07-30T20:50:39Z | 0.65 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | July | 7 | Fri | Fri | 30 | 2010 | False | Summer | train |
| 1 | 2010-07-31 00:00:00+00:00 | 5564 | 23.0 | 15.0 | 18.9 | 23.0 | 15.0 | 18.9 | 14.3 | 76.3 | 2.029 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 19.7 | 246.0 | 1010.9 | 59.2 | 19.7 | 228.0 | 19.6 | 7 | 2010-07-31T05:19:09Z | 2010-07-31T20:49:03Z | 0.68 | Rain, Partially cloudy | Partly cloudy throughout the day with rain. | rain | July | 7 | Sat | Sat | 30 | 2010 | True | Summer | train |
| 2 | 2010-08-01 00:00:00+00:00 | 4303 | 21.0 | 14.0 | 17.4 | 21.0 | 14.0 | 17.4 | 11.5 | 69.1 | 0.000 | 0 | 0.00 | Nil | 0.0 | 0.0 | NaN | 13.4 | 260.6 | 1012.7 | 64.7 | 25.2 | 257.8 | 22.3 | 8 | 2010-08-01T05:20:39Z | 2010-08-01T20:47:25Z | 0.71 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | August | 8 | Sun | Sun | 30 | 2010 | True | Summer | train |
| 3 | 2010-08-02 00:00:00+00:00 | 6642 | 20.4 | 14.3 | 17.1 | 20.4 | 14.3 | 17.1 | 11.6 | 71.1 | 3.635 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 16.2 | 327.6 | 1016.3 | 75.8 | 24.5 | 265.9 | 23.0 | 7 | 2010-08-02T05:22:11Z | 2010-08-02T20:45:45Z | 0.75 | Rain, Partially cloudy | Partly cloudy throughout the day with rain. | rain | August | 8 | Mon | Mon | 31 | 2010 | False | Summer | train |
| 4 | 2010-08-03 00:00:00+00:00 | 7966 | 21.6 | 11.6 | 17.2 | 21.6 | 11.6 | 17.2 | 11.4 | 71.4 | 0.153 | 100 | 8.33 | rain | 0.0 | 0.0 | NaN | 17.6 | 252.7 | 1015.3 | 50.9 | 21.8 | 285.3 | 24.7 | 8 | 2010-08-03T05:23:42Z | 2010-08-03T20:44:03Z | 0.75 | Rain, Partially cloudy | Becoming cloudy in the afternoon with rain clearing later. | rain | August | 8 | Tue | Tue | 31 | 2010 | False | Summer | train |
In [206]:
# Adding a new daylight_time column
bike['sunset_dt'] = pd.to_datetime(bike['sunset'])
bike['sunrise_dt'] = pd.to_datetime(bike['sunrise'])
# Daylight_time is calculated as the sunset time minus the sunrise time
bike['daylight_time'] = bike['sunset_dt'] - bike['sunrise_dt']
bike['daylight_time'] = bike['daylight_time'].astype('int64') / 10000000
# Adding a binary variable COVID for the time period that was affected, to capture variability due to Covid.
bike['COVID'] = np.where(
bike['date'].between('2020-03-16', '2022-12-31'),
1,
0
)
# Creating a variable precipitation squared to capture the polynomial effect on the bikes hired
bike['precip2'] = bike['precip'] ** 2
In [207]:
# Inspecting for the shape and overall summary statistics for bikes and bikes_hired
print(bike.info())
print(bike.head())
print("Overall summary statistics for bikes_hired:")
print(bike['bikes_hired'].describe())
skim(bike)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5268 non-null datetime64[ns, UTC]
1 bikes_hired 5268 non-null int64
2 tempmax 5268 non-null float64
3 tempmin 5268 non-null float64
4 temp 5268 non-null float64
5 feelslikemax 5268 non-null float64
6 feelslikemin 5268 non-null float64
7 feelslike 5268 non-null float64
8 dew 5268 non-null float64
9 humidity 5268 non-null float64
10 precip 5268 non-null float64
11 precipprob 5268 non-null int64
12 precipcover 5268 non-null float64
13 preciptype 5268 non-null object
14 snow 5268 non-null float64
15 snowdepth 5268 non-null float64
16 windgust 4142 non-null float64
17 windspeed 5268 non-null float64
18 winddir 5268 non-null float64
19 sealevelpressure 5268 non-null float64
20 cloudcover 5268 non-null float64
21 visibility 5268 non-null float64
22 solarradiation 5268 non-null float64
23 solarenergy 5268 non-null float64
24 uvindex 5268 non-null int64
25 sunrise 5268 non-null object
26 sunset 5268 non-null object
27 moonphase 5268 non-null float64
28 conditions 5268 non-null object
29 description 5268 non-null object
30 icon 5268 non-null object
31 month_name 5268 non-null category
32 month 5268 non-null int32
33 day_of_week 5268 non-null category
34 wday 5268 non-null object
35 week 5268 non-null int64
36 year 5268 non-null int32
37 weekend 5268 non-null bool
38 season_name 5268 non-null category
39 set 5268 non-null object
40 sunset_dt 5268 non-null datetime64[ns, UTC]
41 sunrise_dt 5268 non-null datetime64[ns, UTC]
42 daylight_time 5268 non-null float64
43 COVID 5268 non-null int64
44 precip2 5268 non-null float64
dtypes: bool(1), category(3), datetime64[ns, UTC](3), float64(23), int32(2), int64(5), object(8)
memory usage: 1.6+ MB
None
date bikes_hired tempmax tempmin temp \
0 2010-07-30 00:00:00+00:00 6897 21.5 10.7 16.8
1 2010-07-31 00:00:00+00:00 5564 23.0 15.0 18.9
2 2010-08-01 00:00:00+00:00 4303 21.0 14.0 17.4
3 2010-08-02 00:00:00+00:00 6642 20.4 14.3 17.1
4 2010-08-03 00:00:00+00:00 7966 21.6 11.6 17.2
feelslikemax feelslikemin feelslike dew humidity precip precipprob \
0 21.5 10.7 16.8 10.3 68.1 0.000 0
1 23.0 15.0 18.9 14.3 76.3 2.029 100
2 21.0 14.0 17.4 11.5 69.1 0.000 0
3 20.4 14.3 17.1 11.6 71.1 3.635 100
4 21.6 11.6 17.2 11.4 71.4 0.153 100
precipcover preciptype snow snowdepth windgust windspeed winddir \
0 0.00 Nil 0.0 0.0 NaN 19.2 252.2
1 8.33 rain 0.0 0.0 NaN 19.7 246.0
2 0.00 Nil 0.0 0.0 NaN 13.4 260.6
3 8.33 rain 0.0 0.0 NaN 16.2 327.6
4 8.33 rain 0.0 0.0 NaN 17.6 252.7
sealevelpressure cloudcover visibility solarradiation solarenergy \
0 1014.0 56.1 22.9 277.5 23.9
1 1010.9 59.2 19.7 228.0 19.6
2 1012.7 64.7 25.2 257.8 22.3
3 1016.3 75.8 24.5 265.9 23.0
4 1015.3 50.9 21.8 285.3 24.7
uvindex sunrise sunset moonphase \
0 9 2010-07-30T05:17:39Z 2010-07-30T20:50:39Z 0.65
1 7 2010-07-31T05:19:09Z 2010-07-31T20:49:03Z 0.68
2 8 2010-08-01T05:20:39Z 2010-08-01T20:47:25Z 0.71
3 7 2010-08-02T05:22:11Z 2010-08-02T20:45:45Z 0.75
4 8 2010-08-03T05:23:42Z 2010-08-03T20:44:03Z 0.75
conditions \
0 Partially cloudy
1 Rain, Partially cloudy
2 Partially cloudy
3 Rain, Partially cloudy
4 Rain, Partially cloudy
description \
0 Partly cloudy throughout the day.
1 Partly cloudy throughout the day with rain.
2 Partly cloudy throughout the day.
3 Partly cloudy throughout the day with rain.
4 Becoming cloudy in the afternoon with rain clearing later.
icon month_name month day_of_week wday week year weekend \
0 partly-cloudy-day July 7 Fri Fri 30 2010 False
1 rain July 7 Sat Sat 30 2010 True
2 partly-cloudy-day August 8 Sun Sun 30 2010 True
3 rain August 8 Mon Mon 31 2010 False
4 rain August 8 Tue Tue 31 2010 False
season_name set sunset_dt sunrise_dt \
0 Summer train 2010-07-30 20:50:39+00:00 2010-07-30 05:17:39+00:00
1 Summer train 2010-07-31 20:49:03+00:00 2010-07-31 05:19:09+00:00
2 Summer train 2010-08-01 20:47:25+00:00 2010-08-01 05:20:39+00:00
3 Summer train 2010-08-02 20:45:45+00:00 2010-08-02 05:22:11+00:00
4 Summer train 2010-08-03 20:44:03+00:00 2010-08-03 05:23:42+00:00
daylight_time COVID precip2
0 5598000.0 0 0.000000
1 5579400.0 0 4.116841
2 5560600.0 0 0.000000
3 5541400.0 0 13.213225
4 5522100.0 0 0.023409
Overall summary statistics for bikes_hired:
count 5268.000000
mean 26327.596811
std 9496.916456
min 0.000000
25% 19746.750000
50% 26070.000000
75% 32745.000000
max 73094.000000
Name: bikes_hired, dtype: float64
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types Categories │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ Dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ Number of rows │ 5268 │ │ float64 │ 23 │ │ month_name │ │ │ │ Number of columns │ 45 │ │ string │ 8 │ │ day_of_week │ │ │ └───────────────────┴────────┘ │ int64 │ 7 │ │ season_name │ │ │ │ datetime64 │ 3 │ └───────────────────────┘ │ │ │ category │ 3 │ │ │ │ bool │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ bikes_hi │ 0 │ 0 │ 26330 │ 9497 │ 0 │ 19750 │ 26070 │ 32740 │ 73090 │ ▁▇▇▃ │ │ │ │ red │ │ │ │ │ │ │ │ │ │ │ │ │ │ tempmax │ 0 │ 0 │ 14.73 │ 6.364 │ -2.6 │ 10.1 │ 14.4 │ 19.5 │ 36.1 │ ▁▅▇▇▂ │ │ │ │ tempmin │ 0 │ 0 │ 7.836 │ 5.039 │ -11.1 │ 4.1 │ 8 │ 11.6 │ 22.1 │ ▁▆▇▆▁ │ │ │ │ temp │ 0 │ 0 │ 11.23 │ 5.481 │ -5.9 │ 7.2 │ 11.1 │ 15.5 │ 28.3 │ ▃▇▇▃ │ │ │ │ feelslik │ 0 │ 0 │ 14.09 │ 7.239 │ -7.2 │ 10.1 │ 14.4 │ 19.5 │ 38.2 │ ▁▅▇▇▂ │ │ │ │ emax │ │ │ │ │ │ │ │ │ │ │ │ │ │ feelslik │ 0 │ 0 │ 6.185 │ 6.376 │ -14.6 │ 1.175 │ 6 │ 11.6 │ 22.1 │ ▂▇▆▇▁ │ │ │ │ emin │ │ │ │ │ │ │ │ │ │ │ │ │ │ feelslik │ 0 │ 0 │ 10.09 │ 6.668 │ -9.1 │ 4.9 │ 10.6 │ 15.5 │ 28.9 │ ▁▅▇▇▅ │ │ │ │ e │ │ │ │ │ │ │ │ │ │ │ │ │ │ dew │ 0 │ 0 │ 7.413 │ 4.672 │ -8 │ 4.1 │ 7.6 │ 11 │ 19 │ ▂▆▇▆▁ │ │ │ │ humidity │ 0 │ 0 │ 79.28 │ 10.4 │ 39.6 │ 72.2 │ 80.3 │ 87.3 │ 99.9 │ ▁▃▇▇▃ │ │ │ │ precip │ 0 │ 0 │ 1.641 │ 4.733 │ 0 │ 0 │ 0.185 │ 1.337 │ 168.2 │ ▇ │ │ │ │ precippr │ 0 │ 0 │ 73.04 │ 44.38 │ 0 │ 0 │ 100 │ 100 │ 100 │ ▃ ▇ │ │ │ │ ob │ │ │ │ │ │ │ │ │ │ │ │ │ │ precipco │ 0 │ 0 │ 8.959 │ 11.6 │ 0 │ 0 │ 8.33 │ 12.5 │ 100 │ ▇▁ │ │ │ │ ver │ │ │ │ │ │ │ │ │ │ │ │ │ │ snow │ 0 │ 0 │ 0.01517 │ 0.2454 │ 0 │ 0 │ 0 │ 0 │ 8.4 │ ▇ │ │ │ │ snowdept │ 0 │ 0 │ 0.04677 │ 0.5492 │ 0 │ 0 │ 0 │ 0 │ 18.1 │ ▇ │ │ │ │ h │ │ │ │ │ │ │ │ │ │ │ │ │ │ windgust │ 1126 │ 21.374335 │ 42.08 │ 14.84 │ 9.4 │ 30.73 │ 40.7 │ 51.8 │ 120.9 │ ▃▇▅▁ │ │ │ │ │ │ 61123766 │ │ │ │ │ │ │ │ │ │ │ │ windspee │ 0 │ 0 │ 22.95 │ 8.421 │ 5.1 │ 16.88 │ 21.8 │ 27.6 │ 72.2 │ ▃▇▃▁ │ │ │ │ d │ │ │ │ │ │ │ │ │ │ │ │ │ │ winddir │ 0 │ 0 │ 197 │ 92.33 │ 0.1 │ 127.8 │ 221.6 │ 259 │ 359.8 │ ▃▃▃▇▇▃ │ │ │ │ sealevel │ 0 │ 0 │ 1015 │ 10.52 │ 966.6 │ 1009 │ 1016 │ 1022 │ 1048 │ ▃▇▅ │ │ │ │ pressure │ │ │ │ │ │ │ │ │ │ │ │ │ │ cloudcov │ 0 │ 0 │ 60.86 │ 22.45 │ 0 │ 46.6 │ 62.7 │ 77.8 │ 100 │ ▁▂▅▇▇▅ │ │ │ │ er │ │ │ │ │ │ │ │ │ │ │ │ │ │ visibili │ 0 │ 0 │ 22.65 │ 10.62 │ 0.2 │ 15.2 │ 22 │ 28.9 │ 62.5 │ ▃▇▇▃▁ │ │ │ │ ty │ │ │ │ │ │ │ │ │ │ │ │ │ │ solarrad │ 0 │ 0 │ 112 │ 84.89 │ 0 │ 40.2 │ 91.5 │ 169.8 │ 358.6 │ ▇▅▃▃▁▁ │ │ │ │ iation │ │ │ │ │ │ │ │ │ │ │ │ │ │ solarene │ 0 │ 0 │ 9.663 │ 7.336 │ 0 │ 3.5 │ 7.9 │ 14.6 │ 31 │ ▇▅▃▃▁▁ │ │ │ │ rgy │ │ │ │ │ │ │ │ │ │ │ │ │ │ uvindex │ 0 │ 0 │ 4.26 │ 2.553 │ 0 │ 2 │ 4 │ 6 │ 10 │ ▅▇▃▆▆▁ │ │ │ │ moonphas │ 0 │ 0 │ 0.4831 │ 0.2888 │ 0 │ 0.25 │ 0.5 │ 0.75 │ 0.98 │ ▇▇▇▇▇▇ │ │ │ │ e │ │ │ │ │ │ │ │ │ │ │ │ │ │ month │ 0 │ 0 │ 6.623 │ 3.456 │ 1 │ 4 │ 7 │ 10 │ 12 │ ▇▇▇▇▇▇ │ │ │ │ week │ 0 │ 0 │ 27.01 │ 15.08 │ 1 │ 14 │ 27 │ 40 │ 53 │ ▇▇▇▇▇▇ │ │ │ │ year │ 0 │ 0 │ 2017 │ 4.169 │ 2010 │ 2014 │ 2017 │ 2021 │ 2024 │ ▆▅▅▇▅▇ │ │ │ │ daylight │ 0 │ 0 │ 4405000 │ 1085000 │ 2803000 │ 3363000 │ 4410000 │ 5441000 │ 6006000 │ ▇▅▅▅▅▇ │ │ │ │ _time │ │ │ │ │ │ │ │ │ │ │ │ │ │ COVID │ 0 │ 0 │ 0.1938 │ 0.3953 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▂ │ │ │ │ precip2 │ 0 │ 0 │ 25.09 │ 515.2 │ 0 │ 0 │ 0.03422 │ 1.786 │ 28290 │ ▇ │ │ │ └──────────┴──────┴───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ category │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ ordered ┃ unique ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ │ │ month_name │ 0 │ 0 │ True │ 12 │ │ │ │ day_of_week │ 0 │ 0 │ True │ 7 │ │ │ │ season_name │ 0 │ 0 │ True │ 4 │ │ │ └──────────────────────────────────┴───────────┴────────────────┴───────────────────────┴────────────────────┘ │ │ bool │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column ┃ true ┃ true rate ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ weekend │ 1506 │ 0.29 │ ▇ ▃ │ │ │ └────────────────────────────┴───────────────────┴──────────────────────────────────┴────────────────────────┘ │ │ datetime │ │ ┏━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ │ │ ┃ column ┃ NA ┃ NA % ┃ first ┃ last ┃ frequency ┃ │ │ ┡━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ │ │ date │ 0 │ 0 │ 2010-07-30 │ 2024-12-30 │ D │ │ │ │ sunset_dt │ 0 │ 0 │ 2010-07-30 20:50:39 │ 2024-12-30 16:00:26 │ None │ │ │ │ sunrise_dt │ 0 │ 0 │ 2010-07-30 05:17:39 │ 2024-12-30 08:06:26 │ None │ │ │ └─────────────────┴──────┴─────────┴─────────────────────────────┴────────────────────────────┴──────────────┘ │ │ string │ │ ┏━━━━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓ │ │ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ chars per ┃ words per ┃ total ┃ │ │ ┃ column ┃ NA ┃ NA % ┃ shortest ┃ longest ┃ min ┃ max ┃ row ┃ row ┃ words ┃ │ │ ┡━━━━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ │ │ preciptyp │ 0 │ 0 │ Nil │ rain,snow │ Nil │ snow │ 4 │ 1 │ 5268 │ │ │ │ e │ │ │ │ │ │ │ │ │ │ │ │ │ sunrise │ 0 │ 0 │ 2010-07-3 │ 2010-07-3 │ 2010-07-3 │ 2024-12-3 │ 20 │ 1 │ 5268 │ │ │ │ │ │ │ 0T05:17:3 │ 0T05:17:3 │ 0T05:17:3 │ 0T08:06:2 │ │ │ │ │ │ │ │ │ │ 9Z │ 9Z │ 9Z │ 6Z │ │ │ │ │ │ │ sunset │ 0 │ 0 │ 2010-07-3 │ 2010-07-3 │ 2010-07-3 │ 2024-12-3 │ 20 │ 1 │ 5268 │ │ │ │ │ │ │ 0T20:50:3 │ 0T20:50:3 │ 0T20:50:3 │ 0T16:00:2 │ │ │ │ │ │ │ │ │ │ 9Z │ 9Z │ 9Z │ 6Z │ │ │ │ │ │ │ condition │ 0 │ 0 │ Rain │ Snow, │ Clear │ Snow, │ 19.3 │ 2.6 │ 13832 │ │ │ │ s │ │ │ │ Rain, │ │ Rain, │ │ │ │ │ │ │ │ │ │ │ Partially │ │ Partially │ │ │ │ │ │ │ │ │ │ │ cloudy │ │ cloudy │ │ │ │ │ │ │ descripti │ 0 │ 0 │ Clearing │ Clear │ Becoming │ Partly │ 50.5 │ 8.2 │ 43311 │ │ │ │ on │ │ │ in the │ condition │ cloudy in │ cloudy │ │ │ │ │ │ │ │ │ │ afternoon │ s │ the │ throughou │ │ │ │ │ │ │ │ │ │ . │ throughou │ afternoon │ t the │ │ │ │ │ │ │ │ │ │ │ t the day │ with a │ day. │ │ │ │ │ │ │ │ │ │ │ with a │ chance of │ │ │ │ │ │ │ │ │ │ │ │ chance of │ rain or │ │ │ │ │ │ │ │ │ │ │ │ rain or │ snow │ │ │ │ │ │ │ │ │ │ │ │ snow │ throughou │ │ │ │ │ │ │ │ │ │ │ │ throughou │ t the │ │ │ │ │ │ │ │ │ │ │ │ t the │ day. │ │ │ │ │ │ │ │ │ │ │ │ day. │ │ │ │ │ │ │ │ │ icon │ 0 │ 0 │ rain │ partly-cl │ clear-day │ wind │ 7 │ 1 │ 5268 │ │ │ │ │ │ │ │ oudy-day │ │ │ │ │ │ │ │ │ wday │ 0 │ 0 │ Fri │ Fri │ Fri │ Wed │ 3 │ 1 │ 5268 │ │ │ │ set │ 0 │ 0 │ train │ train │ train │ train │ 5 │ 1 │ 5268 │ │ │ └───────────┴────┴──────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
In [208]:
#Summary statistics by different variables and relationships
print("\nSummary statistics by year:")
print(bike.groupby('year')['bikes_hired'].describe())
print("\nSummary statistics by day of week:")
print(bike.groupby('day_of_week')['bikes_hired'].describe())
Summary statistics by year:
count mean std min 25% 50% 75% \
year
2010 155.0 14069.761290 5616.909424 2764.0 9297.00 14010.0 18677.50
2011 365.0 19568.353425 5497.468615 4555.0 16260.00 20264.0 23708.00
2012 366.0 26008.969945 9429.395314 3531.0 19282.00 26178.5 32532.75
2013 365.0 22042.353425 7276.488364 3728.0 17555.00 22021.0 27371.00
2014 365.0 27462.731507 9065.086440 4327.0 20532.00 27676.0 34437.00
2015 365.0 27046.134247 8547.890451 5779.0 22056.00 26618.0 32857.00
2016 366.0 28152.013661 8850.957827 4894.0 22402.00 27881.5 35130.00
2017 365.0 28619.298630 8376.552423 5143.0 24064.00 29490.0 34459.00
2018 365.0 28952.164384 10174.346177 5859.0 21759.00 29190.0 37677.00
2019 365.0 28561.520548 8087.904111 5649.0 24198.00 28927.0 34494.00
2020 366.0 28508.653005 11587.302615 4872.0 20147.25 27508.0 36487.25
2021 365.0 29976.065753 10990.961868 6253.0 21703.00 30988.0 38207.00
2022 365.0 31522.936986 10337.169182 0.0 25096.00 31397.0 40020.00
2023 365.0 23373.063014 6029.498261 6721.0 19949.00 23692.0 27805.00
2024 365.0 23987.378082 6143.741429 7479.0 20436.00 24806.0 28656.00
max
year
2010 27964.0
2011 29417.0
2012 47102.0
2013 35580.0
2014 49025.0
2015 73094.0
2016 46625.0
2017 46035.0
2018 46084.0
2019 44668.0
2020 70170.0
2021 56922.0
2022 66972.0
2023 35319.0
2024 35190.0
Summary statistics by day of week:
count mean std min 25% 50% \
day_of_week
Mon 753.0 26050.938911 8277.117203 3971.0 20644.00 25748.0
Tue 752.0 28178.922872 8618.748160 3763.0 22462.75 27854.0
Wed 752.0 28372.493351 8523.335169 4327.0 22752.75 27956.0
Thu 752.0 28275.652926 8788.769687 5649.0 22657.75 27534.0
Fri 753.0 26983.482072 8544.883666 5402.0 21405.00 26532.0
Sat 753.0 24222.066401 11082.313167 0.0 16009.00 22475.0
Sun 753.0 22217.382470 10498.340099 0.0 14080.00 20971.0
75% max
day_of_week
Mon 31469.00 67034.0
Tue 34435.50 65326.0
Wed 34228.50 54360.0
Thu 34432.75 73094.0
Fri 32857.00 66972.0
Sat 30847.00 70170.0
Sun 29669.00 63116.0
In [209]:
print("\nSummary statistics by month:")
print(bike.groupby('month_name')['bikes_hired'].describe())
print("\nSummary statistics by season:")
print(bike.groupby('season_name')['bikes_hired'].describe())
Summary statistics by month:
count mean std min 25% 50% \
month_name
January 434.0 18652.470046 5992.562937 3728.0 13910.25 18964.0
February 396.0 20312.861111 6372.171831 3531.0 15978.25 20646.5
March 434.0 22710.850230 7589.303757 5062.0 17586.75 23052.5
April 420.0 25999.145238 7614.414499 4872.0 21016.50 25932.5
May 434.0 30424.000000 8443.833748 10684.0 24531.25 29895.0
June 420.0 33403.642857 8593.874083 6061.0 27701.50 33057.5
July 436.0 34687.383028 8346.706529 5564.0 29070.50 35781.5
August 465.0 31372.529032 9596.474662 4303.0 25638.00 32439.0
September 450.0 30364.573333 8112.695324 0.0 24866.75 31168.5
October 465.0 27463.823656 6768.732139 7068.0 23281.00 28015.0
November 450.0 23271.486667 6406.332308 6030.0 18766.50 23685.5
December 464.0 17082.303879 6908.340305 2764.0 11396.25 16652.0
75% max
month_name
January 23331.00 38042.0
February 24553.50 52485.0
March 27191.75 56568.0
April 30838.00 49025.0
May 36045.00 70170.0
June 39235.50 65326.0
July 41336.25 73094.0
August 38252.00 66972.0
September 36088.50 51764.0
October 32568.00 47877.0
November 27944.00 44677.0
December 22908.00 39145.0
Summary statistics by season:
count mean std min 25% 50% \
season_name
Winter 1294.0 18597.568779 6576.148434 2764.0 13696.75 18803.0
Spring 1288.0 26382.116460 8505.519735 4872.0 20758.75 26095.0
Summer 1321.0 33112.380772 8982.562793 4303.0 27571.00 33705.0
Autumn 1365.0 27038.025641 7691.127484 0.0 21886.00 27205.0
75% max
season_name
Winter 23515.75 52485.0
Spring 31343.25 70170.0
Summer 39582.00 73094.0
Autumn 32569.00 51764.0
In [210]:
print("\nSummary statistics by weather type:")
print(bike.groupby('icon')['bikes_hired'].describe())
print("\nSummary statistics by precipitation:")
print(bike.groupby('preciptype')['bikes_hired'].describe())
Summary statistics by weather type:
count mean std min 25% \
icon
clear-day 195.0 35032.994872 9594.133137 10558.0 28971.00
cloudy 96.0 22891.750000 7984.384212 7262.0 17242.75
partly-cloudy-day 1127.0 31646.354037 8937.950401 4303.0 25909.00
rain 3797.0 24581.882802 8735.699590 0.0 18553.00
snow 51.0 12412.039216 6271.264600 2764.0 7071.00
wind 2.0 14437.000000 708.520995 13936.0 14186.50
50% 75% max
icon
clear-day 35196.0 42212.5 63116.0
cloudy 22414.5 28001.0 41447.0
partly-cloudy-day 31834.0 38515.0 70170.0
rain 24370.0 30285.0 73094.0
snow 12300.0 16780.5 28765.0
wind 14437.0 14687.5 14938.0
Summary statistics by precipitation:
count mean std min 25% 50% \
preciptype
Nil 1411.0 31617.082211 9247.522684 4303.0 25546.50 31605.0
rain 3556.0 25000.573678 8722.382286 0.0 18969.25 24898.5
rain,snow 287.0 17300.104530 6669.224668 2764.0 12280.00 18061.0
snow 14.0 15350.500000 5938.452592 5621.0 11722.50 15772.0
75% max
preciptype
Nil 38767.50 70170.0
rain 30784.75 73094.0
rain,snow 22549.00 35267.0
snow 20800.50 22460.0
In [211]:
# Time series plot of bikes rented
# While summary statistics allow us to quickly discover key metrics that represent
# the data, they are often not sufficient by themselves. Often, it is useful to
# represent the data graphically (or visually) to uncover information that is
# critical for decision making, such as trends, patterns and outliers.
# Filter data from January 1, 2014 onwards
bike_filtered = bike[bike['date'] >= pd.to_datetime('2014-01-01', utc=True)].copy()
# Create time series plot showcasing bikes hired across past years
plt.figure(figsize=(15, 6))
sns.scatterplot(data=bike_filtered, x='date', y='bikes_hired', alpha=0.4)
plt.title('Time Series Plot of Bikes Hired (2014 onwards)', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Bikes Hired')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
In [212]:
# Creating a histogram plot showcasing the frequency counts of bikes rented every day
sns.histplot(
data=bike, # Specify the DataFrame to use
x='bikes_hired', # Specify the column to plot
bins=40, # Divide the data into 30 bins (bars)
alpha=0.7, # Set transparency (0=invisible, 1=solid)
color='blue', # Set the fill color of the bars
edgecolor='grey', # Add grey lines around each bar for better definition
kde=True # Add a smooth density curve on top of the histogram
)
# Customize plot labels and title
plt.title('Histogram of Bikes Rented', fontsize=16) # Main title
plt.xlabel('Bikes Hired', fontsize=12) # X-axis label
plt.ylabel('Frequency', fontsize=12) # Y-axis label
# Ensure the layout is clean and labels don't overlap
plt.tight_layout()
# Display the plot
plt.show()
In [213]:
# FACETED HISTOGRAM BY SEASON
# Create separate histograms for each season to compare distributions
sns.set_style("whitegrid") # Set consistent style
# Create a grid of histograms, one for each season
g = sns.displot(
data=bike, # The DataFrame to use
x='bikes_hired', # Variable to plot
col='season_name', # Create separate plots for each season
col_wrap=2, # Arrange plots in 2 columns
kind='hist', # Make histogram plots
bins=20, # Use 20 bins per histogram
hue='season_name', # Bar color
edgecolor='black', # Edge color
alpha=0.7 # Transparency
)
# Customize the grid of plots
g.set_axis_labels('Bikes Hired', 'Frequency') # Set labels for all plots
g.set_titles('{col_name}') # Use season name as title for each subplot
g.fig.suptitle('Distribution of Bikes Hired by Season', y=1.03) # Overall title
plt.show()
In [214]:
# Creating a scatterplot to compare distributions of bike hires across weather conditions with weekend-weekday differences
plt.figure(figsize=(14, 7))
sns.stripplot(
data=bike_filtered,
x="icon",
y="bikes_hired",
hue="weekend",
jitter=True,
dodge=True,
alpha=0.6,
palette="Set2"
)
plt.title("Bike Hires by Weather Condition (Scatter, Weekend vs Weekday)", fontsize=16, weight="bold")
plt.xlabel("Weather Condition (Icon)", fontsize=14)
plt.ylabel("Number of Bikes Hired", fontsize=14)
plt.legend(title="Weekend?", fontsize=12, title_fontsize=13)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
In [215]:
# Creating a boxplot to compare distributions of bike hires across months with weekend-weekday differences
plt.figure(figsize=(14, 7))
sns.boxplot(
data=bike_filtered,
x="month_name",
y="bikes_hired",
hue="weekend",
palette="Set3"
)
plt.title("Bike Hires Distribution by Weather Condition (Weekend vs Weekday)", fontsize=16, weight="bold")
plt.xlabel("Month", fontsize=14)
plt.ylabel("Number of Bikes Hired", fontsize=14)
plt.legend(title="Weekend?", fontsize=12, title_fontsize=13)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
In [216]:
# Creating a boxplot to compare distributions of bike hires across precipitation type with weekend-weekday differences
plt.figure(figsize=(14, 7))
sns.boxplot(
data=bike_filtered,
x="preciptype",
y="bikes_hired",
hue="weekend",
palette="Set2"
)
plt.title("Bike Hires Distribution by Weather Condition (Weekend vs Weekday)", fontsize=16, weight="bold")
plt.xlabel("Precipitation Type", fontsize=14)
plt.ylabel("Number of Bikes Hired", fontsize=14)
plt.legend(title="Weekend?", fontsize=12, title_fontsize=13)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
In [217]:
# =============================================================================
# Model building
# =============================================================================
# Correlation - Scatterplot matrix
# Besides the number of bikes rented out on a given day, we have also downloaded
# weather data for London such as mean_temp, humidity, pressure, precipitation, etc.
# that measure weather conditions on a single day. It may be the case that more
# bikes are rented out when it's warmer? Or how can we estimate the effect of rain on rentals?
# Your task is to build a regression model that helps you explain the number of rentals per day.
# Select variables and create a correlation matrix
numerical_vars = ['weekend', 'cloudcover', 'humidity', 'precip', 'temp', 'feelslike', 'bikes_hired']
correlation_data = bike[numerical_vars]
# Create correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = correlation_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='crest', center=0,
square=True, linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Create pairplot for scatterplot matrix
plt.figure(figsize=(10, 8))
sns.pairplot(correlation_data,
hue='weekend',
diag_kind='kde')
plt.suptitle('Scatterplot Matrix of Numerical Variables', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
<Figure size 1000x800 with 0 Axes>
In [218]:
# Select only numeric columns
numeric_df = bike_filtered.select_dtypes(include=['int64', 'float64'])
# Compute correlation matrix
corr = numeric_df.corr()
# Correlation of all variables with bikes_hired
bikes_corr = corr['bikes_hired'].sort_values(ascending=False)
print(bikes_corr)
plt.figure(figsize=(10, 8))
plt.imshow(corr, cmap='magma', interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Matrix")
plt.show()
bikes_hired 1.000000 tempmax 0.663627 feelslikemax 0.654427 feelslike 0.619992 temp 0.616227 solarradiation 0.596642 solarenergy 0.596239 uvindex 0.587361 daylight_time 0.581021 feelslikemin 0.538198 tempmin 0.505682 dew 0.457466 visibility 0.324185 sealevelpressure 0.270639 COVID 0.169397 week 0.112093 moonphase 0.000097 winddir -0.056441 precip2 -0.065451 snow -0.098805 snowdepth -0.115190 precip -0.228450 windspeed -0.252994 windgust -0.256263 precipprob -0.330499 cloudcover -0.345898 precipcover -0.347468 humidity -0.492140 Name: bikes_hired, dtype: float64
In [219]:
# Weekend or weekdays? Is there a difference?
print("\nT-test for difference between weekend and weekday bike rentals:")
# Split data into two groups
weekend_bikes = bike[bike['weekend'] == True]['bikes_hired']
weekday_bikes = bike[bike['weekend'] == False]['bikes_hired']
# Null hypothesis (H0): mean(weekend) == mean(weekday)
# Alternative hypothesis (H1): mean(weekend) != mean(weekday)
t_stat, p_value = stats.ttest_ind(weekend_bikes, weekday_bikes, equal_var=False) # t-test
# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Mean weekend rentals: {weekend_bikes.mean():.2f}")
print(f"Mean weekday rentals: {weekday_bikes.mean():.2f}")
alpha = 0.05
if p_value < alpha:
print(f"\nReject the null hypothesis (p < {alpha}).")
print("There is a statistically significant difference in bike rentals between weekends and weekdays.")
else:
print(f"\nFail to reject the null hypothesis (p ≥ {alpha}).")
print("No statistically significant difference found between weekend and weekday rentals.")
T-test for difference between weekend and weekday bike rentals: T-statistic: -13.9286 P-value: 0.0000 Mean weekend rentals: 23219.72 Mean weekday rentals: 27571.74 Reject the null hypothesis (p < 0.05). There is a statistically significant difference in bike rentals between weekends and weekdays.
In [220]:
#Function create to plot values between actuals and predicted
def plot_actual_vs_predicted(model, model_name, dataframe):
"""
Plots actual data vs. model predictions with specific custom styling.
Args:
model (statsmodels.regression.linear_model.RegressionResultsWrapper):
A fitted statsmodels model object.
model_name (str):
A descriptive name for the model to be used in the plot title.
dataframe (pd.DataFrame):
The original dataframe containing the 'date' and 'bikes_hired' columns.
"""
print(f"\n--- Generating plot for {model_name} ---")
# Create a copy of the dataframe and add predictions to ensure alignment
df_plot = dataframe.copy()
df_plot['predicted_hires'] = model.predict(dataframe)
df_plot['date'] = pd.to_datetime(df_plot['date'])
# Create figure with white background (like ggplot's theme_bw())
plt.figure(figsize=(14, 7), facecolor='white')
# Set white background for the plot area
ax = plt.gca()
ax.set_facecolor('white')
# Add grid lines (like theme_bw())
ax.grid(True, color='lightgray', linestyle='-', linewidth=0.5, alpha=0.7)
ax.set_axisbelow(True) # Put grid behind the data points
# Plot the actual and predicted data
sns.scatterplot(
data=df_plot, x='date', y='bikes_hired',
color='black', alpha=0.3, label='Actual Hires'
)
sns.scatterplot(
data=df_plot, x='date', y='predicted_hires',
color='blue', alpha=0.7, label='Predicted Hires'
)
# Remove padding from the x-axis so it starts and ends on the actual data range
ax.set_xlim(df_plot['date'].min(), df_plot['date'].max())
# Set custom date ticks to show every 6 months (Jan and Jul)
ax.xaxis.set_major_locator(mdates.MonthLocator(bymonth=[1, 7]))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%y'))
# Improve label readability by rotating them
plt.xticks(rotation=45, ha='right')
# Set black borders around the plot (like theme_bw())
for spine in ax.spines.values():
spine.set_edgecolor('black')
spine.set_linewidth(1)
# Finalize the plot
plt.title(f'Actual vs. Predicted Bike Hires - {model_name}')
plt.xlabel('')
plt.ylabel('Number of Bikes Hired')
plt.legend()
# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()
In [221]:
# =============================================================================
# Model 0: using the mean to predict bikes_hired
# =============================================================================
# We start with the naive model where we just use the average to predict how many
# bikes we are going to rent out on a single day
print("Summary statistics for bike_filtered:")
print(bike_filtered['bikes_hired'].describe())
# Create mean model using statsmodels
# --- Model 0: Intercept-only (Naive Mean Model) ---
print("\n--- Model 0: Intercept Only ---")
# Statsmodels Version
model0_sm = smf.ols('bikes_hired ~ 1', data=bike_filtered).fit()
print("Statsmodels - Model 0 Summary:")
print(model0_sm.summary().tables[1])
print(f"Residual SE: {model0_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).
# Plot actual vs predicted for Model 0
#plt(model0_sm, "Model 0: Mean Model", bike_filtered)
# Scikit-learn Version of the same model
model0 = LinearRegression()
y = bike_filtered['bikes_hired']
y_pred_null = np.full_like(y, y.mean()) # Predict the mean for all observations
rmse_null = np.sqrt(mean_squared_error(y, y_pred_null))
print("\nScikit-learn - Model 0 (Null Model):")
print(f" Prediction (Mean): {y.mean():.2f}")
print(f" Root Mean Squared Error: {rmse_null:.2f}")
print(f"Residual SE: {model0_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).
Summary statistics for bike_filtered:
count 4017.000000
mean 27833.153099
std 9357.364234
min 0.000000
25% 21718.000000
50% 27676.000000
75% 34372.000000
max 73094.000000
Name: bikes_hired, dtype: float64
--- Model 0: Intercept Only ---
Statsmodels - Model 0 Summary:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.783e+04 147.640 188.521 0.000 2.75e+04 2.81e+04
==============================================================================
Residual SE: 9357.364
Scikit-learn - Model 0 (Null Model):
Prediction (Mean): 27833.15
Root Mean Squared Error: 9356.20
Residual SE: 9357.364
In [222]:
# Split the data into train and test sets (80% train, 20% test)
train, test = train_test_split(bike_filtered, test_size=0.2, random_state=42)
# Define your model formula (the best OLS)
formula = 'bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2'
# Fit the model on the training set
model_train = smf.ols(formula=formula, data=train).fit()
In [223]:
print("\n--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")
# Statsmodels Version
precip2 = bike_filtered["precip"] ** 2
month2 = bike_filtered["month"] ** 2
model1_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
print("Statsmodels - Model 1 Summary:")
print(model1_sm.summary().tables[1])
print("R-squared:", model1_sm.rsquared.round(4))
print("Adjusted R-squared:", model1_sm.rsquared_adj.round(4))
print(f"Residual SE: {model1_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).
# Plot actual vs predicted for Model 1
plot_actual_vs_predicted(model1_sm, "Model 1: Temperature Model", bike_filtered)
# Questions:
# - Is the effect of temp significant? Why?
# - What proportion of the overall variability in bike rentals does temperature explain?
print(f"R-squared: {model1_sm.rsquared:.4f}")
print(f"Adjusted R-squared: {model1_sm.rsquared_adj:.4f}")
--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---
Statsmodels - Model 1 Summary:
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.544e+04 661.419 23.350 0.000 1.41e+04 1.67e+04
C(COVID)[T.1] 1567.9292 232.103 6.755 0.000 1112.879 2022.980
C(icon)[T.cloudy] -904.1495 804.736 -1.124 0.261 -2481.881 673.582
C(icon)[T.partly-cloudy-day] 865.9369 485.767 1.783 0.075 -86.438 1818.312
C(icon)[T.rain] -157.7997 479.952 -0.329 0.742 -1098.773 783.174
C(icon)[T.snow] -1324.5228 1139.197 -1.163 0.245 -3557.983 908.937
C(weekend)[T.True] -1529.6283 197.488 -7.745 0.000 -1916.814 -1142.443
C(day_of_week)[T.Tue] 2210.7634 342.044 6.463 0.000 1540.167 2881.360
C(day_of_week)[T.Wed] 2504.8051 341.956 7.325 0.000 1834.381 3175.230
C(day_of_week)[T.Thu] 2520.0487 341.932 7.370 0.000 1849.671 3190.427
C(day_of_week)[T.Fri] 1005.5123 341.854 2.941 0.003 335.288 1675.736
C(day_of_week)[T.Sat] 382.7882 197.441 1.939 0.053 -4.306 769.882
C(day_of_week)[T.Sun] -1912.4165 197.472 -9.684 0.000 -2299.572 -1525.261
tempmax 849.3996 16.181 52.493 0.000 817.676 881.124
windspeed -209.1950 11.810 -17.713 0.000 -232.349 -186.041
precip -489.5830 30.772 -15.910 0.000 -549.913 -429.253
visibility 170.6835 9.811 17.397 0.000 151.448 189.919
precip2 2.8301 0.262 10.807 0.000 2.317 3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322
--- Generating plot for Model 1: Temperature Model ---
R-squared: 0.6189 Adjusted R-squared: 0.6174
In [224]:
print("\n--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")
print("\n-Train Dataset")
# Statsmodels Version
precip2 = bike_filtered["precip"] ** 2
month2 = bike_filtered["month"] ** 2
model_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
print("Statsmodels - Model Summary:")
print(model_sm.summary().tables[1])
print("R-squared:", model_sm.rsquared.round(4))
print("Adjusted R-squared:", model_sm.rsquared_adj.round(4))
print(f"Residual SE: {model_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).
print("\n-Test Dataset")
# Fit the model on the training set
model_sm = smf.ols(formula=formula, data=train).fit()
# Predictions on training set
y_train_pred = model_sm.predict(train)
rmse_train = np.sqrt(np.mean((train['bikes_hired'] - y_train_pred) ** 2))
# Predictions on test set
y_test_pred = model_sm.predict(test)
rmse_test = np.sqrt(np.mean((test['bikes_hired'] - y_test_pred) ** 2))
print(model_train.summary2())
print(f"Training RMSE: {rmse_train:.2f}")
print(f"Test RMSE: {rmse_test:.2f}")
--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---
-Train Dataset
Statsmodels - Model Summary:
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.544e+04 661.419 23.350 0.000 1.41e+04 1.67e+04
C(COVID)[T.1] 1567.9292 232.103 6.755 0.000 1112.879 2022.980
C(icon)[T.cloudy] -904.1495 804.736 -1.124 0.261 -2481.881 673.582
C(icon)[T.partly-cloudy-day] 865.9369 485.767 1.783 0.075 -86.438 1818.312
C(icon)[T.rain] -157.7997 479.952 -0.329 0.742 -1098.773 783.174
C(icon)[T.snow] -1324.5228 1139.197 -1.163 0.245 -3557.983 908.937
C(weekend)[T.True] -1529.6283 197.488 -7.745 0.000 -1916.814 -1142.443
C(day_of_week)[T.Tue] 2210.7634 342.044 6.463 0.000 1540.167 2881.360
C(day_of_week)[T.Wed] 2504.8051 341.956 7.325 0.000 1834.381 3175.230
C(day_of_week)[T.Thu] 2520.0487 341.932 7.370 0.000 1849.671 3190.427
C(day_of_week)[T.Fri] 1005.5123 341.854 2.941 0.003 335.288 1675.736
C(day_of_week)[T.Sat] 382.7882 197.441 1.939 0.053 -4.306 769.882
C(day_of_week)[T.Sun] -1912.4165 197.472 -9.684 0.000 -2299.572 -1525.261
tempmax 849.3996 16.181 52.493 0.000 817.676 881.124
windspeed -209.1950 11.810 -17.713 0.000 -232.349 -186.041
precip -489.5830 30.772 -15.910 0.000 -549.913 -429.253
visibility 170.6835 9.811 17.397 0.000 151.448 189.919
precip2 2.8301 0.262 10.807 0.000 2.317 3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322
-Test Dataset
Results: Ordinary least squares
=======================================================================================
Model: OLS Adj. R-squared: 0.620
Dependent Variable: bikes_hired AIC: 64757.4969
Date: 2025-09-10 23:23 BIC: 64860.7712
No. Observations: 3213 Log-Likelihood: -32362.
Df Model: 16 F-statistic: 328.8
Df Residuals: 3196 Prob (F-statistic): 0.00
R-squared: 0.622 Scale: 3.2989e+07
---------------------------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 15194.2729 730.8150 20.7909 0.0000 13761.3592 16627.1866
C(COVID)[T.1] 1720.4153 256.0561 6.7189 0.0000 1218.3645 2222.4661
C(icon)[T.cloudy] -93.6921 903.0693 -0.1037 0.9174 -1864.3460 1676.9618
C(icon)[T.partly-cloudy-day] 1095.2609 526.4740 2.0804 0.0376 63.0000 2127.5218
C(icon)[T.rain] 91.5252 521.4307 0.1755 0.8607 -930.8474 1113.8977
C(icon)[T.snow] -620.5594 1282.3316 -0.4839 0.6285 -3134.8353 1893.7165
C(weekend)[T.True] -1611.0289 222.5938 -7.2375 0.0000 -2047.4700 -1174.5877
C(day_of_week)[T.Tue] 2228.8975 379.0936 5.8795 0.0000 1485.6061 2972.1888
C(day_of_week)[T.Wed] 2273.1796 379.9990 5.9821 0.0000 1528.1131 3018.2461
C(day_of_week)[T.Thu] 2575.8420 383.6790 6.7135 0.0000 1823.5600 3328.1240
C(day_of_week)[T.Fri] 949.3550 382.2902 2.4833 0.0131 199.7961 1698.9139
C(day_of_week)[T.Sat] 381.6113 220.8522 1.7279 0.0841 -51.4150 814.6376
C(day_of_week)[T.Sun] -1992.6402 222.9681 -8.9369 0.0000 -2429.8152 -1555.4651
tempmax 850.4917 17.9804 47.3010 0.0000 815.2374 885.7460
windspeed -208.5108 13.1943 -15.8031 0.0000 -234.3809 -182.6407
precip -482.2870 33.7555 -14.2877 0.0000 -548.4716 -416.1024
visibility 168.7889 10.7874 15.6469 0.0000 147.6380 189.9398
precip2 2.7572 0.2759 9.9925 0.0000 2.2162 3.2982
---------------------------------------------------------------------------------------
Omnibus: 202.941 Durbin-Watson: 1.941
Prob(Omnibus): 0.000 Jarque-Bera (JB): 855.464
Skew: -0.128 Prob(JB): 0.000
Kurtosis: 5.515 Condition No.: 404652720913748480
=======================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The smallest eigenvalue is 8.52e-27. This might indicate that there
are strong multicollinearity problems or that the design matrix is
singular.
Training RMSE: 5728.39
Test RMSE: 5973.22
In [225]:
# After fitting your model (model1_sm)
# Get the design matrix X from the fitted model
X = model1_sm.model.exog
features = model1_sm.model.exog_names
# Drop Intercept if present
if "Intercept" in features:
idx = features.index("Intercept")
X = X[:, [i for i in range(X.shape[1]) if i != idx]]
features = [f for f in features if f != "Intercept"]
# Compute VIF
vif_df = pd.DataFrame({
"feature": features,
"VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})
print("\n--- Variance Inflation Factor (VIF) ---")
print(vif_df.sort_values(by="VIF", ascending=False).round(3))
--- Variance Inflation Factor (VIF) ---
feature VIF
5 C(weekend)[T.True] inf
11 C(day_of_week)[T.Sun] inf
10 C(day_of_week)[T.Sat] inf
13 windspeed 9.059
3 C(icon)[T.rain] 8.695
15 visibility 7.780
12 tempmax 6.181
2 C(icon)[T.partly-cloudy-day] 3.625
14 precip 3.332
16 precip2 2.862
8 C(day_of_week)[T.Thu] 1.884
7 C(day_of_week)[T.Wed] 1.879
9 C(day_of_week)[T.Fri] 1.864
6 C(day_of_week)[T.Tue] 1.861
0 C(COVID)[T.1] 1.641
1 C(icon)[T.cloudy] 1.162
4 C(icon)[T.snow] 1.111
In [226]:
# Run Breusch-Pagan test
bp_test = het_breuschpagan(model1_sm.resid, model1_sm.model.exog)
# Unpack results
bp_stat, bp_pvalue, f_stat, f_pvalue = bp_test
print("Breusch-Pagan test results:")
print(f" Lagrange Multiplier statistic: {bp_stat:.4f}")
print(f" LM test p-value: {bp_pvalue:.4f}")
print(f" F-statistic: {f_stat:.4f}")
print(f" F-test p-value: {f_pvalue:.4f}")
# Interpretation
alpha = 0.05
if bp_pvalue < alpha:
print("\nReject the null hypothesis: Evidence of heteroskedasticity.")
else:
print("\nFail to reject the null hypothesis: No evidence of heteroskedasticity.")
Breusch-Pagan test results: Lagrange Multiplier statistic: 325.0416 LM test p-value: 0.0000 F-statistic: 22.0101 F-test p-value: 0.0000 Reject the null hypothesis: Evidence of heteroskedasticity.
In [227]:
# Split the data into train and test sets (80% train, 20% test)
train, test = train_test_split(bike_filtered, test_size=0.2, random_state=42)
# Defining model formula (Model 1)
specification = 'bikes_hired ~ tempmax + daylight_time + C(icon) + C(weekend) + C(season_name) + humidity + precipprob'
# Model 2
formula = 'bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2'
# Fit the model on the training set
model_train = smf.ols(formula=formula, data=train).fit()
In [228]:
# Model 1 - Linear Fit
model = smf.ols(specification, data=train).fit()
print(model.summary())
y_pred = model.predict(test) # fit to test data
rmse = np.sqrt(mean_squared_error(test['bikes_hired'], y_pred)) # generate rmse metric
r2 = r2_score(test['bikes_hired'], y_pred) # generate r squared metric
# print
print(f"Test RMSE: {rmse:.2f}")
print(f"Test R²: {r2:.2f}")
# VIF code
y, X = dmatrices(specification, data=train, return_type="dataframe") # build design matrices (y = target, X = predictors including dummies & interactions)
# Compute VIF for each column in X to check
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\nVariance Inflation Factors:")
print(vif_data)
# create plot for prediction vs actual
# ensure set to date time
test['date'] = pd.to_datetime(test['date'])
# create copy to avoid disrupting original dataframe
test_sorted = test.copy()
# amend predictions as column
test_sorted['y_pred'] = y_pred
# sort by date for neat look in graph
test_sorted = test_sorted.sort_values('date')
# plotting out the graph
plt.figure(figsize=(12,6))
plt.plot(test_sorted['date'], test_sorted['bikes_hired'], label='Actual', color='blue') # plot actual
plt.plot(test_sorted['date'], test_sorted['y_pred'], label='Predicted', color='red', linestyle='--') # plot predictions
plt.title("Predicted vs Actual Bikes Hired (Test Sample, Model 1)", fontsize=14) # title
plt.xlabel("Date")
plt.ylabel("Bikes Hired")
plt.legend()
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: bikes_hired R-squared: 0.548
Model: OLS Adj. R-squared: 0.547
Method: Least Squares F-statistic: 353.1
Date: Wed, 10 Sep 2025 Prob (F-statistic): 0.00
Time: 23:23:23 Log-Likelihood: -32649.
No. Observations: 3213 AIC: 6.532e+04
Df Residuals: 3201 BIC: 6.539e+04
Df Model: 11
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 2.034e+04 1611.190 12.627 0.000 1.72e+04 2.35e+04
C(icon)[T.cloudy] -868.7907 997.875 -0.871 0.384 -2825.330 1087.748
C(icon)[T.partly-cloudy-day] 263.2170 583.404 0.451 0.652 -880.666 1407.100
C(icon)[T.rain] 1046.4496 629.399 1.663 0.096 -187.616 2280.515
C(icon)[T.snow] -1046.7858 629.463 -1.663 0.096 -2280.978 187.406
C(weekend)[T.True] -4016.6069 246.860 -16.271 0.000 -4500.627 -3532.587
C(season_name)[T.Spring] -1064.5449 537.028 -1.982 0.048 -2117.498 -11.592
C(season_name)[T.Summer] 782.6292 672.384 1.164 0.245 -535.717 2100.975
C(season_name)[T.Autumn] 3817.3756 383.640 9.950 0.000 3065.171 4569.580
tempmax 433.2920 33.653 12.875 0.000 367.308 499.276
daylight_time 0.0026 0.000 9.800 0.000 0.002 0.003
humidity -111.6164 15.008 -7.437 0.000 -141.042 -82.191
precipprob -33.6238 8.783 -3.828 0.000 -50.846 -16.402
==============================================================================
Omnibus: 172.279 Durbin-Watson: 1.935
Prob(Omnibus): 0.000 Jarque-Bera (JB): 578.010
Skew: 0.175 Prob(JB): 3.07e-126
Kurtosis: 5.048 Cond. No. 1.10e+21
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.51e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Test RMSE: 6433.71
Test R²: 0.54
Variance Inflation Factors:
feature VIF
0 Intercept 211.829516
1 C(icon)[T.cloudy] 1.537673
2 C(icon)[T.partly-cloudy-day] 4.918740
3 C(icon)[T.rain] inf
4 C(icon)[T.snow] inf
5 C(weekend)[T.True] 1.001384
6 C(season_name)[T.Spring] 4.396925
7 C(season_name)[T.Summer] 7.001070
8 C(season_name)[T.Autumn] 2.245775
9 tempmax 3.699342
10 daylight_time 6.928360
11 humidity 2.045534
12 precipprob inf
In [229]:
# Model 2 - OLS along with polynomial relationship
print("\n--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")
# Redefining precip 2 to be safe
precip2 = bike_filtered["precip"] ** 2
# Statsmodels Version
model2_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
# Printing out results data
print("Statsmodels - Model 2 Summary:")
print(model2_sm.summary().tables[1])
print("R-squared:", model2_sm.rsquared.round(4))
print("Adjusted R-squared:", model2_sm.rsquared_adj.round(4))
print(f"Residual SE: {model2_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).
# Plot actual vs predicted for Model 1
plot_actual_vs_predicted(model2_sm, "Model 2: Non-linear Model", bike_filtered)
--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---
Statsmodels - Model 2 Summary:
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.544e+04 661.419 23.350 0.000 1.41e+04 1.67e+04
C(COVID)[T.1] 1567.9292 232.103 6.755 0.000 1112.879 2022.980
C(icon)[T.cloudy] -904.1495 804.736 -1.124 0.261 -2481.881 673.582
C(icon)[T.partly-cloudy-day] 865.9369 485.767 1.783 0.075 -86.438 1818.312
C(icon)[T.rain] -157.7997 479.952 -0.329 0.742 -1098.773 783.174
C(icon)[T.snow] -1324.5228 1139.197 -1.163 0.245 -3557.983 908.937
C(weekend)[T.True] -1529.6283 197.488 -7.745 0.000 -1916.814 -1142.443
C(day_of_week)[T.Tue] 2210.7634 342.044 6.463 0.000 1540.167 2881.360
C(day_of_week)[T.Wed] 2504.8051 341.956 7.325 0.000 1834.381 3175.230
C(day_of_week)[T.Thu] 2520.0487 341.932 7.370 0.000 1849.671 3190.427
C(day_of_week)[T.Fri] 1005.5123 341.854 2.941 0.003 335.288 1675.736
C(day_of_week)[T.Sat] 382.7882 197.441 1.939 0.053 -4.306 769.882
C(day_of_week)[T.Sun] -1912.4165 197.472 -9.684 0.000 -2299.572 -1525.261
tempmax 849.3996 16.181 52.493 0.000 817.676 881.124
windspeed -209.1950 11.810 -17.713 0.000 -232.349 -186.041
precip -489.5830 30.772 -15.910 0.000 -549.913 -429.253
visibility 170.6835 9.811 17.397 0.000 151.448 189.919
precip2 2.8301 0.262 10.807 0.000 2.317 3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322
--- Generating plot for Model 2: Non-linear Model ---
In [230]:
# Model 2 Out-of-sample Testing
print("\n--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")
# Training dataset
print("\n-Train Dataset")
# Statsmodels Version
model_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
print("Statsmodels - Model Summary:")
print(model_sm.summary().tables[1])
# Printing results data
print("R-squared:", model_sm.rsquared.round(4))
print("Adjusted R-squared:", model_sm.rsquared_adj.round(4))
print(f"Residual SE: {model_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).
# Testing dataset
print("\n-Test Dataset")
# Fit the model on the training set
model_sm = smf.ols(formula=formula, data=train).fit()
# Predictions on training set
y_train_pred = model_sm.predict(train)
rmse_train = np.sqrt(np.mean((train['bikes_hired'] - y_train_pred) ** 2))
# Predictions on test set
y_test_pred = model_sm.predict(test)
rmse_test = np.sqrt(np.mean((test['bikes_hired'] - y_test_pred) ** 2))
# Printing results data
print(model_train.summary2())
print(f"Training RMSE: {rmse_train:.2f}")
print(f"Test RMSE: {rmse_test:.2f}")
--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---
-Train Dataset
Statsmodels - Model Summary:
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.544e+04 661.419 23.350 0.000 1.41e+04 1.67e+04
C(COVID)[T.1] 1567.9292 232.103 6.755 0.000 1112.879 2022.980
C(icon)[T.cloudy] -904.1495 804.736 -1.124 0.261 -2481.881 673.582
C(icon)[T.partly-cloudy-day] 865.9369 485.767 1.783 0.075 -86.438 1818.312
C(icon)[T.rain] -157.7997 479.952 -0.329 0.742 -1098.773 783.174
C(icon)[T.snow] -1324.5228 1139.197 -1.163 0.245 -3557.983 908.937
C(weekend)[T.True] -1529.6283 197.488 -7.745 0.000 -1916.814 -1142.443
C(day_of_week)[T.Tue] 2210.7634 342.044 6.463 0.000 1540.167 2881.360
C(day_of_week)[T.Wed] 2504.8051 341.956 7.325 0.000 1834.381 3175.230
C(day_of_week)[T.Thu] 2520.0487 341.932 7.370 0.000 1849.671 3190.427
C(day_of_week)[T.Fri] 1005.5123 341.854 2.941 0.003 335.288 1675.736
C(day_of_week)[T.Sat] 382.7882 197.441 1.939 0.053 -4.306 769.882
C(day_of_week)[T.Sun] -1912.4165 197.472 -9.684 0.000 -2299.572 -1525.261
tempmax 849.3996 16.181 52.493 0.000 817.676 881.124
windspeed -209.1950 11.810 -17.713 0.000 -232.349 -186.041
precip -489.5830 30.772 -15.910 0.000 -549.913 -429.253
visibility 170.6835 9.811 17.397 0.000 151.448 189.919
precip2 2.8301 0.262 10.807 0.000 2.317 3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322
-Test Dataset
Results: Ordinary least squares
=======================================================================================
Model: OLS Adj. R-squared: 0.620
Dependent Variable: bikes_hired AIC: 64757.4969
Date: 2025-09-10 23:23 BIC: 64860.7712
No. Observations: 3213 Log-Likelihood: -32362.
Df Model: 16 F-statistic: 328.8
Df Residuals: 3196 Prob (F-statistic): 0.00
R-squared: 0.622 Scale: 3.2989e+07
---------------------------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 15194.2729 730.8150 20.7909 0.0000 13761.3592 16627.1866
C(COVID)[T.1] 1720.4153 256.0561 6.7189 0.0000 1218.3645 2222.4661
C(icon)[T.cloudy] -93.6921 903.0693 -0.1037 0.9174 -1864.3460 1676.9618
C(icon)[T.partly-cloudy-day] 1095.2609 526.4740 2.0804 0.0376 63.0000 2127.5218
C(icon)[T.rain] 91.5252 521.4307 0.1755 0.8607 -930.8474 1113.8977
C(icon)[T.snow] -620.5594 1282.3316 -0.4839 0.6285 -3134.8353 1893.7165
C(weekend)[T.True] -1611.0289 222.5938 -7.2375 0.0000 -2047.4700 -1174.5877
C(day_of_week)[T.Tue] 2228.8975 379.0936 5.8795 0.0000 1485.6061 2972.1888
C(day_of_week)[T.Wed] 2273.1796 379.9990 5.9821 0.0000 1528.1131 3018.2461
C(day_of_week)[T.Thu] 2575.8420 383.6790 6.7135 0.0000 1823.5600 3328.1240
C(day_of_week)[T.Fri] 949.3550 382.2902 2.4833 0.0131 199.7961 1698.9139
C(day_of_week)[T.Sat] 381.6113 220.8522 1.7279 0.0841 -51.4150 814.6376
C(day_of_week)[T.Sun] -1992.6402 222.9681 -8.9369 0.0000 -2429.8152 -1555.4651
tempmax 850.4917 17.9804 47.3010 0.0000 815.2374 885.7460
windspeed -208.5108 13.1943 -15.8031 0.0000 -234.3809 -182.6407
precip -482.2870 33.7555 -14.2877 0.0000 -548.4716 -416.1024
visibility 168.7889 10.7874 15.6469 0.0000 147.6380 189.9398
precip2 2.7572 0.2759 9.9925 0.0000 2.2162 3.2982
---------------------------------------------------------------------------------------
Omnibus: 202.941 Durbin-Watson: 1.941
Prob(Omnibus): 0.000 Jarque-Bera (JB): 855.464
Skew: -0.128 Prob(JB): 0.000
Kurtosis: 5.515 Condition No.: 404652720913748480
=======================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The smallest eigenvalue is 8.52e-27. This might indicate that there
are strong multicollinearity problems or that the design matrix is
singular.
Training RMSE: 5728.39
Test RMSE: 5973.22
In [231]:
# VIF Tests (Model 2)
# We carried out VIF testing using the statsmodel library, rightaway from the fitted models
X = model1_sm.model.exog
# Defining features as the variables we used to predict in our model
features = model1_sm.model.exog_names
# Compute VIF stats
vif_df = pd.DataFrame({
"feature": features,
"VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})
# Printing out results
print("\n--- Variance Inflation Factor (VIF) ---")
print(vif_df.sort_values(by="VIF", ascending=False).round(3))
--- Variance Inflation Factor (VIF) ---
feature VIF
11 C(day_of_week)[T.Sat] inf
6 C(weekend)[T.True] inf
12 C(day_of_week)[T.Sun] inf
0 Intercept 52.451
4 C(icon)[T.rain] 5.851
3 C(icon)[T.partly-cloudy-day] 5.060
15 precip 3.017
17 precip2 2.857
8 C(day_of_week)[T.Wed] 1.717
9 C(day_of_week)[T.Thu] 1.717
10 C(day_of_week)[T.Fri] 1.716
7 C(day_of_week)[T.Tue] 1.715
2 C(icon)[T.cloudy] 1.534
16 visibility 1.410
5 C(icon)[T.snow] 1.306
13 tempmax 1.241
1 C(COVID)[T.1] 1.224
14 windspeed 1.185
In [232]:
# Model 3: XGBoost with the same specification
target = "bikes_hired" # define target
features = ["tempmax", "daylight_time", "icon", "weekend", "season_name", "humidity", "precipprob","year"] # define features
X = bike_filtered[features].copy() # duplicate dataset
y = bike_filtered[target] # rename target by convention
# xgboost cannot handle categorical variables, so must one hot encode
X = pd.get_dummies(X, columns=["icon", "weekend", "season_name"], drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # train test split
# Fit model
model = xgb.XGBRegressor(
n_estimators=500, # number of trees
learning_rate=0.05, # within regular range
max_depth=6, # maximum 6 layers of trees
subsample=0.8, # each tree trained on 80% of data
colsample_bytree=0.8, # tree sees 80% of features
random_state=123, # for reproducibility
gamma=1.0, # minimum loss reduction for a split (prevent overfitting)
reg_lambda=1.0, # L2 regularization (prevent overfitting)
reg_alpha=0.1, # L1 regularization (prevent overfitting)
)
# Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test) # generate predictions as before
# Recompute metrics as before
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f}")
print(f"Test R²: {r2:.2f}")
# Replicate code for graphing
test_plot = test.copy()
test_plot["y_pred"] = model.predict(X.loc[test.index]) # align predictions with test set rows
test_plot["date"] = pd.to_datetime(test_plot["date"])
test_plot = test_plot.sort_values("date")
plt.figure(figsize=(12,6))
plt.plot(test_plot["date"], test_plot["bikes_hired"], label="Actual", color="blue")
plt.plot(test_plot["date"], test_plot["y_pred"], label="Predicted", color="red", linestyle="--")
plt.title("Predicted vs Actual Bikes Hired (Model 3)", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Bikes Hired")
plt.legend()
plt.show()
# Most important features plot
xgb.plot_importance(model, max_num_features=10)
plt.title("Top 10 Important Features (XGBoost)")
plt.show()
# Partial dependence plot
features_to_plot = ["tempmax","weekend_True","humidity","year"] # selected most important features
fig, ax = plt.subplots(figsize=(10, 8))
PartialDependenceDisplay.from_estimator(model, X_train, features_to_plot, grid_resolution=50, ax=ax)
fig.suptitle("Partial Dependence Plots for Most Important Features", fontsize=16, y=1.02)
plt.show()
Test RMSE: 4941.74 Test R²: 0.75